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1. 


INTRODUCTION 


1.1  The  Sorting  Problem  on  a  Network 

This  paper  concerns  the  problem  of  parallel  sorting  on  a  fixed 

connection  network  G=  (V,E)  of  N  nodes  V={v„, _ ,v  Each  node 

O  N-l 

v^£v  is  initially  input  a  set  X^  of  cQ  distinct  keys.  Thus  the  set 
X=XQU  •••  of  all  keys  input  is  of  size  cQN.  We  assume  a  relation 

<  total  ordering  X.  The  network  sorts  X  by  routing  each  key  x£X  to 
node  Vj  where  j=  irank(x)/cQj  and  rank(x)  =  |{x’  £X|x'  <  x}  |  .  Thus  each 
sorting  problem  on  network  G  can  be  viewed  as  a  distributed  routing 
problem.  Each  key  x  €  X  is  considered  a  packet  which  must  be  routed  from 
its  initial  location  ,  where  x  £  X^ ,  to  the  destination  node  v..  where 
the  index  j  =  irank(x)/cQj  must  also  be  computed  distributively . 

1 . 2  Assumptions 

We  assume  each  node  £  V  contains  a  single  sequential  processor 
with  local  storage  for  O(log  N)  packets.  These  processors  execute 
synchronously.  At  a  single  step  each  processor  may  make  an  elementary 
operation  such  as  a  key  comparison,  and  cause  transmission  of  a  packet 
across  each  departing  edge.  Each  processor  uses  randomization  in  the  sense 
of  Rabin  [76]  and  Solovay  and  Strassen  [77].  It  is  allowed  on  each  step  to 
choose  random  bits  independently  of  the  probabilistic  choices  of  any  other 
processor. 

With  these  assumptions,  the  routing  required  to  sort  on  network  G 
requires  time  at  least  its  diameter  max^  ^^{djd  is  the  length  of  the 
shortest  path  from  u  to  v}.  If  G  has  constant  valence,  then  G  has 
diameter  Q(log  N) .  Thus  SUlog  N)  is  a  lower  bound  for  the  time  to  sort 
on  any  constant  valence  network  of  N  processes. 
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1 . 3  FLASHSORT 


Our  FLASHSORT  algorithm  for  parallel  sorting  uses  a  probabilistic 
divide  and  conquer  technique  similar  to  the  popular  sequential  sorting 
algorithm  QUICKSORT  of  Hoare  [62].  QUICKSORT  is  popular  because  it  can 
be  practically  implemented  and  is  very  fast.  Sedgewick  [75]  shows  that 
QUICKSORT  takes  expected  time  less  than  cN  log  N  for  a  small  constant  c, 
given  N  randomly  permuted  input  keys.  FLASHSORT  has  similar  advantages 
which  we  feel  will  lead  to  its  practical  utilization  on  distributed 
networks . 

FLASHSHORT  is  executed  in  4  phases  sketched  below: 

I.  (Random  Routing)  Route  each  key  x€x  to  a  randomly  chosen  node. 

II.  (Splitter  Directed  Routing)  Choose  a  random  key  J €  X.  Use  it 
to  split  X-{o)  into  disjoint  subsets  {x€x|x<o}  and  {x€x[o<x}. 
Route  these  two  disjoint  subsets  to  disjoint  subnetworks,  and  recursively 
apply  II.  This  gives  a  rough  sort  of  the  keys  X  into  disjoint  subsets 
of  small  cardinality. 

III.  (Rank  Computation)  Compute  rank  (x)  for  each  key  x€x. 

IV.  (Rank  Directed  Routing)  Route  each  x€x  to  node  v  ,  ,  .  , 

- 2-  irank(x)/c0j . 

1.4  Organization  and  Results 

Section  2  defines  the  CCC  network  and  a  related  network  CCC+  on 
which  we  implement  FLASHSORT.  Section  3  describes  the  random  routing  in 
phase  I  and  a  probabilistic  strategy  for  routing  in  phase  IV  proposed  by 
Valiant  and  Brebner,  [81].  Section  4  gives  the  details  of  the  splitter- 
directed  routing  in  phase  II.  Section  5  describes  the  rank  computation  in 
phase  III.  Sections  6-11  provide  a  probabilistic  analysis  of  phases  II 
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and  III;  we  show  that  for  any  c,  cQ  above  a  small  number,  3c'  >1  such 

that  phases  II  and  III  take  time  less  than  c  log  N  with  probability 
— c 1 

greater  than  1  -  N  .  Similar  bounds  on  time  have  been  proved  by  Aleliunas 
1 82]  and  Upfal  [  82]  for  the  packet  routing  in  phases  I  and  IV.  Thus  we 
conclude  that  on  any  given  input,  FLASHSORT  achieves  asymptotic  optimal 
time  c  log  N,  for  a  small  constant  c,  on  all  but  a  vanishingly  small 
number  of  executions.  Section  12  describes  some  modifications  to  FLASHSORT 
which  make  it  more  practical  when  cQ  is  small. 


1 . 5  Previous  Work 


All  previous  algorithms  for  sorting  N  keys  on  a  constant  valence 

2 

fixed  connection  network  of  N  processors  require  time  fi(log  N)  .  The 
parallel  sorting  algorithm  of  Batcher  [68]  achieves  this  time  bound  on 
various  N  node  constant  valence  networks  such  as  the  CCC  of  Preparata  and 
Vuillemin  [81]. 

For  less  realistic  models  of  computation  faster  algorithms  are  known. 
Several  years  ago  J.  Wiedermann  observed  that  QUICKSORT  takes  time  c  log  N 
with  high  likelihood  on  a  parallel  decision  tree  model  with  N  processors. 
Reischuk  [81]  has  a  similar  result  for  parallel  random  access  machines. 

Our  algorithm  follows  the  randomized  routing  ideas  introduced  in 
Valiant  [82].  In  the  proofs  heavy  use  is  made  of  the  critical  path 
technique  developed  by  Aleluinas  [82]  and  Upfal  [82]. 


2. 


NETWORK  DEFINITIONS 


Fix  some  number  n^l.  This  section  defines  two  constant  valence 
networks  derived  from  the  hypercube  of  dimension  n.  These  networks  have 
the  same  node  set  V  =  {  (ot,i)  |ot  €  {0,l}n,  i  €  {0, . . .  ,n-l}}  with  cardinality 
N  =  n2n.  For  each  v€v  let  address  (v)  =  a  and  stage  (v)  =  i  where  v=  (a,i). 
Let  cx I i ]  be  the  i-th  bit  of  ot  and  let  a'=EXT(a,i)  be  identical  to  a 
except  a'[i]  is  the  complement  of  a[ij.  Let  edge  (u,v)€v*V  be 
internal  if  address (v)  = address (u)  or  external  if  address (v)  = 

EXT (address (u) , stage (u) +1) .  Also,  let  (u,v)  be  fcrtvara  if  stage  (v)  = 

(stage (u)  +1)  mod  n,  static  if  stage  (v)  = stage  (u),  or  reverse  if 
stage (v)  =  (stage (u)  -  1)  mod  n. 

The  Cube  Connected  Cycles  (CCC^)  network  of  Preparata  and  Vuillemin 

[81]  has  node  set  V  and  exactly  all  forward- internal  edges,  reverse- 

internal  edges,  and  static-external  edges.  For  technical  reasons  this 

paper  will  assume  a  network  previously  defined  in  Upfal  [82]  which  we 

call  the  CCC  network.  It  has  node  set  V  and  exactly  all  forward-internal 
n 

edges,  reverse-internal  edges,  and  forward-external  edges.  Thus  the  CCC 

n 

and  CCC*  differ  only  with  respect  to  the  stage  portions  of  external 

edges.  Clearly  any  algorithm  requiring  routing  on  the  CCC*  network  can 

be  simulated  on  the  CCC  network  with  at  most  a  factor  of  2  time  increase. 

n 

(Since  the  transmission  in  CCC*  of  a  packet  x  across  a  forward-external 
edge  (u,v)  can  be  simulated  in  CCCn  by  transmission  of  x  across 
forward-internal  edge  (u,w)  followed  by  static-external  edge  (w,v).) 

Note  the  cccn  and  CCC*  are  both  naturally  related  to  the  hyper¬ 
cube  H  of  dimension  n.  Intuitively,  for  each  a€{o,l}n  the  set  of 
n 

nodes  {u € v| address (u)  ■  a}  can  be  considered  to  be  a  "supernode"  of  H  . 

n 

Each  such  "supemode"  is  connected  by  external  edges  to  n  other  "super¬ 
nodes"  {v€  v|  address  (v)  -  EXT(a,i)}  for  i«0,...,n-l. 
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3.  PACKET  ROUTING  ON  THE  CCC+ 

n 

This  section  briefly  describes  the  probabilistic  packet  routing  algo¬ 
rithm  of  Valiant  and  Brebner  [81]  as  applied  to  the  CCC+  by  Upfal  [82] . 

This  routing  is  required  to  implement  phases  I  and  IV  of  FLASHSORT. 

We  require  that  each  node  v  €  V  contain  for  each  departing  edge  e 
a  queue  Qg  of  packets  to  be  transmitted  across  edge  e.  Each  node  also 
contains  its  address  and  stage  posted  as  local  variables. 

Let  X  be  the  set  of  cQN  packets  to  be  routed,  where  each  packet 
x€x  is  initially  at  a  given  node  Ix  €  V  and  we  wish  x  to  be  routed  to 
given  destination  node  D^GV.  The  algorithm  has  two  phases: 

A.  (Random  Routing)  route  x  from  I  to  a  node  R^ £  V  with  random 
address . 

B.  (Fixed  Destination  Routing)  route  x  from  R^  to  D^. 

The  random  routing  of  x  in  Phase  A  is  accomplished  by  repeating 
for  n  stages  the  transmission  of  x  across  a  randomly  chosen  departing 
forward  edge  (i.e.,  transmit  x  across  the  forward-internal  edge  or 
forward-external  edge  with  equal  probability) .  Phase  B  repeats  for  n 
stages  the  following:  if  x  is  currently  at  node  V^DX  with  j  =  stage (v)  +1 
and  address (v) [j ]=  address (Dx> [j } ,  then  x  is  transmitted  across  the 
forward- internal  edge  departing  v  and  otherwise  x  is  transmitted  across 
the  forward-external  edge  departing  v.  This  takes  the  packets  to  the 
correct  addresses.  Finally,  they  are  pipelined  to  the  correct  nodes  in 
the  cycles. 

We  have  not  yet  specified  the  management  of  the  queues  of  packets  at 
each  node.  Suppose  the  priority  of  packet  x€x  is  assigned  to  be  the 
number  of  stages  of  phases  A  and  B  so  far  accomplished,  and  we  allow  packet 
x  to  be  transmitted  from  each  node  v€v  only  after  all  packets  of  lower 


priority  are  transmitted  from  v.  Aleliunas  [82]  and  Upfal  [82]  show: 

theorem  3.1.  For  any  c  above  a  mall  constant  number t  3c'  >1 
such  that  the  execution  time  of  phase  A  and  B  exceeds  cn  with  probability 

— c  * 

at  most  N 

4.  SPLITTER  DIRECTED  ROUTING 

This  section  describes  the  splitter  directed  routing  in  phase  II  of 
FLASHSORT.  Let  X[A]=X  be  the  set  of  cQN  keys  input  to  FLASHSORT, 
where  A  is  the  empty  string.  We  index  certain  subsets  of  X  by 
{0,l}<n+1>,  the  set  of  binary  strings  of  length  at  most  n.  Phase  II  is 

executed  in  stages  i  =  0, _ ,n-l  where  for  each  BGio.l}1  if  x[6]^0 

we  choose  a  random  key  O[0]  €X[8J  which  splits  X[B]  -  {o  [ Bl  }  into  dis¬ 
joint  subsets  X[B0]  =  {x€  x[6]  |x  <  o [ B]  }  and  x[Bl]  =  {x€  X[B]  |o[B]  <  x} . 

(The  intention  is  to  route  the  elements  of  X[B]  to  the  subcube  specified 
by  6.) 

If  o[B]  is  never  defined  for  some  8€{0,l}<n>  then  phase  II  has  a 
blockage  and  cannot  be  completed.  This  event  is  shown  in  Section  8  to  have 
vanishingly  low  probability.  If  blockage  does  occur,  then  the  execution  of 
phase  II  will  exceed  a  time  limit  c^n  (determined  in  Section  11  to  hold 
with  high  likelihood  assuming  no  blockage)  and  phases  I  and  II  must  be  re- 
executed.  The  probability  of  blockage  in  phase  II  the  next  time  is  independent 
of  the  first  event  of  blockage  in  phase  II.  If  there  is  no  blockage,  then 
phase  II  yields  a  rough  sort  of  the  keys  X  into  a  total  of  2n+1  subsets, 
where  2n  disjoint  subsets  are  of  expected  size  less  than  c^n  with  form 
X[a)  where  a€  {0,l}n  and  there  are  also  2n-l  singleton  sets  of  form 
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(ct  1 8  ] }  #  where  8e{0,l}<n>.  Note  that  if  a,  a*  £  {0,l}  where  string  a 
precedes  a'  in  lexical  order,  then  x£x[a],  y£x[a']  imply  that  x<y 
in  the  total  order. 

We  also  define  a  recursive  subdivision  of  the  node  set  V,  where  we 

index  subsets  by  binary  strings  in  {o,l}  For  each  8€  {o,l}n,  let 

V[8]  =  {v  £  v|  address  (v)  =  8  and  stage(v)=o}.  For  each  6  €  { 0 , 1 }  n>  let 

V[6]  =  {v£v|8  is  a  prefix  of  address  (v)  and  |  8 1  *  stage  (v)  }  .  Let  the 

n-  I  8  I 

root  r[8]  of  V [ 8 ]  be  the  node  with  addres  80  1  1  and  stage  |8|. 

Note  that  for  |8|  <  n-1,  r I 8]  has  a  departing  forward-internal  edge 
entering  r[80]  and  also  a  departing  forward-external  edge  entering  r [ Si ] . 
Also  note  that  V[80]  and  V[8l]  are  disjoint. 

For  each  i  =  0,...,n  let  =  {v  £  V  |  stage  (v)  =  i } . 

We  assume  phase  I  routes  each  key  x€x  to  a  random  node  v  £  VQ 
and  that  the  set  of  keys  queued  at  each  v € VQ  are  randomly  ordered. 

The  stage  of  key  x€x  executing  phase  II  is  the  stage  of  the  node  where 
x  is  currently  visiting,  except  we  define  the  stage  of  x  to  be  n  in  the 
case  x  has  just  completed  stage  n-1  (and  so  just  been  routed  from  a  node 
of  stage  n-1  to  a  node  of  stage  0) .  In  the  routing  of  phase  II  each  of  the 
keys  x  £  X[B]  visits  only  nodes  in  V  whose  address  is  prefixed  by  6,  once 
x  has  passed  stage  | 8 | . 

Initially  each  v€V[81  is  said  to  be  inactive,  and  transmits  no  keys 
of  stage  |B|.  We  activate  r[B]  by  choosing  a[B]  to  be  a  particular  key 
in  XlB]  reaching  the  root  node  r 1 6 ]  so  that  a { 8]  is  a  random  element  of 
X[8]  (e.g.,  we  may  let  0[81  be  the  first  key  of  stage  |B|  reaching  r  [6] ) . 

By  a  method  described  below,  a  copy  of  o[6)  is  routed  to  each  node  v£v[8). 
Node  v  becomes  activated  when  this  copy  reaches  it.  It  is  able  to  transmit 
keys  of  stage  |8|  only  after  it  has  been  activated. 

I 


Let  x€x[8]-  {(T[$]}  be  a  key  at  stage  |8|  and  visiting  a  node  v€v{8]. 
The  key  x  remains  at  node  v  until  v  has  been  activated.  When  v  has  been 
activated,  if  x<0[8]  then  x  is  transmitted  across  the  forward-internal  edge 
departing  v  to  a  node  in  V[B0],  and  otherwise  if  x>c[B]  then  x  is  trans¬ 
mitted  across  the  forward-external  edge  departing  v  to  a  node  in  V[8l]. 

Thus  after  all  keys  in  X  have  completed  stage  n-1,  we  have  for 
each  8  € {o,l}n  routed  the  keys  X 1 33  to  the  node  (8,0). 

Now  we  give  the  details  of  how,  for  any  8€{0,l}<n>,  copies  of 

0[B)  are  routed  to  each  node  in  V[6] .  A  copy  s  of  oIS]  is  called 
a  splttter-  For  technical  reasons  (e.g.,  so  that  we  do  not  confuse  the  delays 
due  to  key  routing  with  those  due  to  splitter  routing)  a  splitter  is 
considered  a  different  type  of  packet  from  a  key.  (N.B.  The  type 
{key,  splitter}  of  a  packet  can  be  specified  by  a  boolean  flag  attached 
to  the  packet.)  The  stage  of  a  splitter  s  copied  from  otBl  is  fixed  to 
be  1 8 1  and  does  not  vary  during  routing . 

The  splitter  routing  for  6  begins  at  root  node  r[81  when  a [81 
is  chosen.  Splitters  s,  s'  copied  from  o[B]  are  transmitted  across 
the  two  forward  edges  (i.e. ,  the  forward-internal  edge  and  forward-external 
edge)  departing  from  r [8] . 

Suppose  node  v  €  V  receives  a  splitter  s  from  an  edge  e  entering 
v.  If  e  is  a  forward  edge  and  stage (v)  >  0  then  v  transmits  a  copy 
of  s  across  each  of  the  two  forward  edges  departing  v.  If  e  is 
a  forward-external  edge,  then  v  transmits  splitter  s  across  the  reverse- 
internal  edge  departing  v  in  addition.  If  e  is  a  reverse-internal  edge  and 
stage (v)  > stage (s)  then  s  is  transmitted  across  the  reverse-internal 
edge  departing  v. 
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THEOREM  4.1.  Each  v€v[g]  is  activated  in  2  (n-|g|-£.(v) )  steps 

£ 

after  rig]  has  been  activated  (where  l(v)  =  max{Jt|o  is  a  suffix  of 
address (v) }). 


Proof .  Recall  that  address  (r  [g]  )  =  gOn  ^ 
JKv) 


and  note  that 


address  (v)  =  gyO  v  where  and  k  =  n—  |  B  |  —  (v)  -  Splitters 

copied  from  o[g]  are  routed  from  vQ  =  r[g]  on  forward  edges 

(Vq/V^)  ,  - . . ,  and  then  on  reverse-internal  edges  , . . . , 

(v  ,v  ).  For  i = 1, — ,k  let  (v.  ,v.)  be  the  forward- internal 

ZK~±  <bK  l“i  X 

edge  departing  v.  if  address(v^  t | g|+i]  = address (v^) [ | g|+i]  and 

otherwise  let  (v.  ,fv.)  be  the  forward-external  edge  departing  v.  , . 

i-l  i  r~l 

Thus  address (v2k)  =  address (v^)  =  address (v)  and  stage (v2k)  *  stage (v^)  -  k 
=  (stage (v)  +  k)  -  k  =  stage (v) .  Hence  v2k  =  v.  c 


5.  THE  RANK  COMPUTATION 

This  section  describes  the  rank  computation  done  in  phase  III  of 
FLASHSORT. 

We  show: 

THEOREM  5.1.  The  rank  computation  can  be  done  in  time 

4n  +  3  l  max  ( | X [ 6 ) | ) ] 

6€{0,l}n 

Proof.  We  begin  by  sorting  the  keys  of  X[g]  for  each  g£{0,l}n.  This  easily 
can  be  done  in  time  2|x[g]|  by  a  parallel  bubble  sort  using  the  cycle  of  n 
nodes  { (g,0) , (g,n-l) }  connected  by  internal  edges. 
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Next  we  compute  at  root  r[g]  the  value  |x[B]|»  for  each  8  6{0,l}<n>. 
This  is  done  in  n  stages  i  =  0,...,n-l  using  the  identity  |x[B]|  =  |x[80]| 

+  |x[8l] |  + 1  for  each  86  {o,l)*.  Note  that  the  two  forward  edges  departing 
r[8]  enter  both  r[80]  and  r[8l],  so  the  roots  form  a  binary  tree  of  depth 
n.  The  required  sums  on  this  tree  can  be  done  in  time  2n  using  carry-adder 
logic  [Tung,  72]  to  stream  the  bits  of  partial  sums  up  the  tree  from  nodes 
r[8o]  and  r[8l]  to  node  r[8]  for  each  86{o,l)  n>. 

Finally,  we  compute  rank  by  the  following  rule: 

PROPOSITION  5.1.  rank (o[A] )  =  |x[0]|  +1.  Fop  each  86  {o,l}<n-1>, 
rank  (o[8l] )  *rank(o[8l)  +  !  X  [310]  j  +1  and  rank(o[80])  *rank(0[8])  -  |x[B01]|  -  1. 
For  each  8  6(o,l}n  1 ,  and  each  x6x[Bl]»  rank(x)  *  rank (o [8] )  +  r+  1  where 
r=  |{x16x[6l]  |x1<x} |  is  the  rank  of  x  in  X [ 61 ] ,  and  for  each  x'6x[80],  rank(x') 
rank  (o  [  B  ] )  +  r '  -  |  x  [80]  j  f  where  r'  is  the  rank  of  x’  in  X[Bo]. 

The  additions  required  by  this  final  computation  can  be  done  in  time 
2n  +  maxg£^0  ^jr.  (|x[8]|)  i  again  using  carry-adder  logic.  o 

6.  CHERNOFF  BOUNDS 

This  section  gives  some  probabilistic  inequalities,  which  will  be 
useful  in  the  sections  following. 


I^t  binomial  variable  b  be  the  sum  of  N  independent  Bernoulli 

trials  each  with  success  probability  p.  Then  the  mean  of  b  is  b»Np  and 

the  probability  that  b>m  is  B(m,N,p)*I?  (”)  pk(l-p)N"k.  The 

k*m  K 

following  inequality  can  be  derived  (see  Angluin  and  Valiant  [79])  from  the 
bounds  of  Chernoff  (52]. 
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LEMMA  6.1.  For  any  c,  0<C<1,  B ( (l+c)b,N,p) <  exp(-c2b/2),  ana 
B(  (l-c)b-l,N,p) >  1  -  exp (-c2b/3) . 

Let  g  be  the  sum  of  N  independent  geometric  random  variables  g  ,  —  ,gN 
with  Prob <g^=k)  =  p (1-p)  for  k^O.  Then  g  has  mean  g  =  N(l-p)/p. 

Let  G(m,N,p)  be  the  probability  m^g. 

lemma  6.2.  For  each  c^l,  v/  m=cg  then  G(m,N,p)  >  l  - 
(l+(c-l) (l-p))m+N/cm. 

Proof .  For  any  z  €  [1A/(1“P))»  Chernoff  [52]  shows  G(mrN,p)  is 

upper  bounded  by  z  m  times  the  generating  function  of  g,  which  is 
N 

(p/ (1- (1-p) z) )  •  Our  lemma  follows  from  the  case  z = c/ (l+(c-l) (1-p) ) .  □ 

7.  THE  DISTRIBUTION  OF  KEYS  ROUTED  IN  PHASES  I  AND  II 

For  each  i  =  0,...,n  and  v€V^,  let  X(v,i)cx  be  the  set  of  keys 
visiting  node  v  at  stage  i  of  phase  II.  Fix  a  constant  c^  satisfying 
0<c^<l,  and  let  dQ  =  (l-c^)c0. 

Let  be  the  event  |x(v,0)|>  dQn  for  all  v£vo. 

THEOREM  7.1.  Prob  (<fQ)  >  1  -  2nf Q  (n) ,  where  f0(n)  « exp(-c2cQn/3) . 

Proof .  Fix  some  v€VQ.  Each  key  x£x  has  independent  probability 
|vQ|  1*2  n  of  being  routed  to  node  v  by  phase  I.  Thus  the  event 
x€X(v,0)  is  an  independent  Bernoulli  trail  with  success  probability  2  n. 
|x(v,0)|  is  therefore  a  binomial  variable  with  parameters  |xj  * cQN,  2  n 
and  mean  cQn.  Therefore  Prob ( |x (v,0) |>dQn)  > 1  -  B(dQn,c0N, 2  n) > fQ(n) 
by  Lemma  6.1.  Hence  Prob  («fQ)  >  1  -  [vo  |  fQ  (n)  «  1  -  2nfQ  (n) ,  since  |vo|«2n.° 
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Fix  constant  d^  *  (I+c^JCq.  Let  be  the  event  |x(v,i)|  <  d^n 

for  all  i  =  0,...,n  and  v€v^. 

THEOREM  7.2.  Prob (#0)  > 1  -  Nf Q (n) ,  where  fQ(n)  -  exp (-c*cQn/2) . 

Proof.  Fix  some  vCV,  where  0<i<n.  Since  each  key  x€x  is 

routed  in  phase  I  to  a  random  node  in  VQ,  its  route  in  phase  II  is  also 

random,  so  the  event  x€X(v,i)  is  upper  bounded  by  an  independent 

Bernoulli  variable  with  success  probability  IvJ  ^ = 2  n.  Ix(v,i)l  is 

upper  bounded  by  a  binomial  variable  with  parameters  Ixl  * cQN,  2  n 

and  mean  c^n.  Therefore  Prob(IX(v,i)  I  >d^n)  <B(d^n,c0N,2  n)<f1(n)  by 

A 

Lemma  6.1.  Hence  Probf^^)  ^  1  -  Jv  lf^  (n)  =  1  -  Nf^  (n),  since  |v|=N.  o 

From  Theorems  3.2  and  7.2  it  follows 

corollary  7.2.  The  rank  computation  of  -phase  III  takes  time  at  most 
(4+3d^)n  with  probability  at  least  l-2nf0(n). 

8.  ACTIVATION  PROBABILITIES 

Let  be  the  event  all  nodes  in  V  are  eventually  activated. 

Note  that  event  holds  iff  there  is  no  blockage  in  phase  II.  Our 

calculation  of  the  probability  of  is  complicated  by  the  fact  that 

the  routing  of  a  key  x  stops  at  root  node  r[$]  if  o[£]  is  chosen  to 

be  x.  For  each  6  6{0,l}<n>  where  |g|  >0  let  r  [$)  be  the  node  of 

stage  |$|  -1  such  that  there  is  a  forward  edge  from  r~[f5]  to  r[B]  and 
■  .  th 

the  |p|  bit  of  r  [$]  is  1.  Since  root  node  r[&]  has  address 

eon~lel 


,  we  have: 
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PROPOSITION  8.1.  If  x€X(r_[BJ,|B|-l)  then  x  visits  no  root 
node  on  any  stage  i <  | B | - 

Tto  simplify  our  calculation  of  the  probability  of  we  make  the 

following  assumption  about  che  procedure  for  choosing  ol  61  •  It  ensures  that  the 
keys  that  are  candidates  for  becoming  splitters  at  a  root  have  never  been 
candidates  at  previous  roots . 

A1  For  each  8€{0,l}<n>  if  |B|-0  then  o[81  is  chosen  to  be  a 

random  key  in  X(r[A],0)  and  if  |B|>0  then  o[0]  is  chosen  to 
be  the  first  key  entering  r!81  from  node  r  IB] - 

-n-1  C0N 

THEOREM  8.1.  Prob (i^)  >  1  -  2nf :  (n) ,  where  ^ (n)  =  (l-2~n  )  • 

<  exp (-Cpn/2 ) . 


Proof .  For  1  =  0,..., n-1  let  <^(8,i)  be  the  event  that  for  each 
B€{0,  l}i,  X(r  181  ,i)  t  ®.  For  each  86  {o,!}1  with  l<i<n-l  and  each 
key  x€x,  x  visits  r  1 8 J  with  probability  2  n.  Also,  x  reaches  r[6] 
by  way  of  r  [8]  with  probability  2  n  1.  Hence,  for  any  8,  i  the 
probability  of  exceeds 

1  -  (l-2_n_1)  >  1  -  f^n)  . 

This  holds  even  for  8  =  A  and  i  =  0,  when  (A,0)  > 1  -  (1-2  n)  ^  > 1  -  f^ (n) . 

Hence  the  probability  that  all  the  events  <^(8,1)  occur  is  greater  than 

1  -  £  f .  (n)  >  1  -  2nf.  (n) . 

if  8  1  1 

By  Theorem  4.1  if  all  these  events  occur  then  all  nodes  will  be  activated 
eventually.  o 


9. 


DELAY  SEQUENCES 


( 

i 

To  simplify  the  calculation  of  the  total  time  required  in  phase  II 
we  make  the  assumption: 

A2  At  the  start  of  phase  II  the  keys  of  X(v,0)  are  assigned  distinct 
priorities  tt  «  0, . . . ,  |x(v,0)  |  -  1  for  each  v£VQ.  Thereafter  in 
phase  II  the  priority  tt(x)  of  each  key  x€X  is  fixed.  Each  active 
node  v€v  transmits  keys  of  lowest  priority  first,  so  that  no  key 
x  of  priority  tt(x)  is  transmitted  from  v  before  a  key  x*  of 
priority  tt  (x * )  <  tt  (x)  is  transmitted  from  v. 

For  each  node  v€V  and  integer  tt>0,  let  the  key  task  [v,tt]  be 
the  job  of  transmitting  key  packets  of  priority  it  from  node  v.  For 
each  v€V  and  i,  0<i<n,  let  the  splitter  task  (v,i)  be  the  job  of 
transmitting  a  splitter  of  stage  i  from  node  v.  (Recall  that  for  a  key 
x€X  visiting  node  v,  stage (x)  =  stage (v) ,  whereas  for  a  splitter  s 
visiting  node  v,  stage (s)  is  the  stage  of  the  node  where  the  splitter 
was  created.) 

We  define  a  precedence  relation  between  these  tasks,  where  X  T ' 
if  the  completion  of  some  job  in  x'  must  be  delayed  until  some  job  in  X 
is  completed.  For  each  u,  v€V  and  tt , tt • ,  i>0  we  may  have: 

(1)  Iv,tt]  -*•  Iv,tt+1]  (since  the  transmission  of  a  key  x  of  priority 
ir+1  may  not  be  done  before  the  transmission  from  v  of  keys  of  priority  tt) 

(2)  [u,TTj  [v, m]  for  each  forward  edge  (u,v)  (since  1  step  is 

required  to  transmit  a  key  across  edge  (u,v)  and  the  key's  priority  does 
not  change). 
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(3)  lu,ir]  -*•  (v,i)  where  v  is  a  root  node  x[ 8],  u»r”[$]  for 

some  86(o,l}i  (since  a  key  x  of  priority  u  may  be  chosen  o[B]*x). 

(4)  (u,i)  (v,i)  where  (u,v)  is  an  edge  of  the  CCC+  network 

n 

(since  a  splitter  may  be  routed  on  any  of  the  types  of  edges  on  the  CCC+) . 

n 

(5)  (u,i)  ■*  [v,tt]  where  (u,v)  is  a  reverse -internal  edge  and 
stage (v)  *i.  (Since  a  key  of  any  priority  tt  cannot  be  transmitted  from  v 
until  node  v  has  been  activated  by  reception  of  a  splitter  of  stage  i 
from  a  node  u) . 

(6)  [u,ir'  ]  ■+  [v,7T]  where  v  is  a  root  node  r[B],  u=r~[B]  for 

some  B€{0,l}1  (since  a  key  of  any  priority  7T  cannot  be  transmitted 

from  v  until  a  splitter  has  been  created  at  v) . 

A  A 

Let  6  =  •  •  ^n-l^n-1  a  sequence  if  for  i  n-1 

is  a  (possibly  empty)  sequence  of  key  tasks  of  stage  i  which  are 

A 

related  by  -*■,  and  6i  is  a  (possibly  empty)  sequence  of  splitter  tasks 
of  stage  i  related  by  -*■  .  Necessarily  8^  is  of  the  form: 

(Vq #i)  ,  •  •  •  t  (vk^^,i) f  *  •  •  •  #  (v^k  ^ f i) 

where  i  =  stage  (vQ),  i+1  =  stage  (v^^) ,  (v..,vj+i)  is  a  forward  edge  for 

j  =  0,...,k-l,  (v^^v^)  is  a  forward-external  edge,  and  (vk»vk+1) , . . . , 

(v2k_2,v2k_1)  are  reverse-internal  edges. 

With  each  execution  of  phase  II  we  can  associate  the  set  of  delay 
sequences  that  describe  the  temporal  sequences  of  causality  in  the  obvious 
way.  Thus  two  tasks  are  adjacent  in  such  a  sequence  if  the  second  one  could 
not  have  been  begun  a  time  unit  earlier  than  it  was  because  the  first  task 
had  not  been  completed  yet.  Our  analysis  will  assume  an  "oracular" 
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version  of  the  algorithm  that  only  starts  executing  [u,tt]  when  all  the 
keys  of  priority  smaller  than  t  that  are  to  pass  through  u  have  already 
done  so.  The  reader  can  verify  by  a  little  reflection  that  our  analysis 
for  the  oracular  algorithm  upper  bounds  the  performance  of  the  actual 
algorithm.  Another  way  is  by  augmenting  the  algorithm  to  enforce 
priorities  as  suggested  by  Upfal  [82]. 
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10.  UPPER  BOUNDS  ON  THE  LENGTH  AND  NUMBER  OF  DELAY  SEQUENCES 


Let  A  be  a  set  of  all  delay  sequences  occurring  in  an  execution  of 
FLASHSORT.  Fix  a  constant  c2  satisfying  0  ^  c2  <  min(l,d0-4) . 

Let  be  the  event  1 6 ( <  (c2+4)n  for  all  6  €  A. 

d  n 

THEOREM  io.l.  Assuming  <?2,  | A |  <  2  ,  where  d2 *  1  +  | (c2+4)log  6]. 

Proof.  Note  that  each  5  £  A  can  be  completely  specified  by  the 
start  node  in  Vq,  and  a  binary  sequence  of  length  [|6|log  6*1,  (since  there  are 
6  types  of  pairs  of  tasks  related  by  ■*■) .  By  assumption  <?2 , 

|4|<|V0|2  61". 


6 


The  Lemmas  10.1  and  10.2  proved  in  this  section  imply: 

d  n 

THEOREM  10.2.  Prob (<fQ  )  >  1  -  2  f2  (n) ,  where 

.  (c_+l)n  c,n 

f2(n)  =  (1 +  (c2-l)exp(-  j))  /c2 

lemma  10.1.  With  certainty ,  l”"*  I  I  ^  2x1  for>  any  delay  sequence 


6-6.. . . 6  . 6  , . 

0  0  n-1  n-1 


Proof .  Recall  from  Section  4  that  £(v)  is  the  length  of  the  longest 
suffix  of  address (v)  in  0*.  If  (u,v)  is  a  forward  edge  and 
stage  (u)  <  n-1,  then  £(v)>£(u).  Thus  £(v)  never  increases  from  successive 

A 

key  tasks  [u , tt]  -►  [v,tt' ]  appearing  in  any  6^.  Let  6i»(vQ,i)  ...  (v^^.i)  . 
By  Theorem  4.1,  £(vQ)  >  Mv^^K  |  ^  |/2.  This  implies  1 | /2  <  n.  o 

For  any  &€{0,l}n,  let  be  the  prefix  of  6  of  length  i.  Let 


£  7T(ate.  1 )  <  (c2  +  l)n,  for  all  0€{o,l}n 
i*0  A 


gp  be  the  event: 
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LEMMA  10.2. 


Prob  {&\gQ  a  )  >  1  -  2nf2  (n) 


Proof.  Consider  some  fixed  6  € {0,l}n.  Recall  by  assumption  A1  that 

for  each  i=  1, ...,n-l  the  root  r[B^]  chooses  0[B]  to  be  the  first  key 

entering  r[B^]  from  node  r  18^],  and  the  i+1  bit  of  address (r  [B^] ) 

is  1.  Thus  the  set  tb  =  { (Y10n  ^O)  | y  €  {o,!}1  are  those  nodes  of  VQ 

for  which  there  is  a  path  of  i-1  forward  edges  to  r  [B^l.  Note  that 

tb  0  lb  =  0  for  i^j.  By  assumption  |x(v,0)|>d0n  for  all 

v  €  V  .  'Let  X.  .  =  {x£X(v,0)  Jv€  U.  and  x  has  priority  j}. 

U  x ,  3  1 

Since  each  key  in  X(v,0)  has  distinct  priority,  |x.  .  |  *21.  Each 

r  3 

x£X.  has  independent  probability  |  lb  |  of  getting  routed  to  r  [8^] 

and  probability  1/2  of  transmission  from  r  [ 3^ ]  to  r [ Bi ] .  Hence  for 

each  xCX.  ■  the  event  x  visits  r [B-]  is  lower  bounded  by  an  independent 
i ,  j  i 

Bernoulli  variable  with  success  probability  |tb  |  1/2. 

Let  tt.  =ir(o[Bil) .  For  each  j,  0<j<dn, 

.  |u.| 

Prob(Tr.=j  |  tt  .  >  j)  =Prob(3x€X.  .  where  x  visits  r  [6.  ] )  >  1  -  (1- |u.  |  /2)  1 
>  1  -  exp(-l/2) .  This  implies  Prob(Tb=j)  ^p(l-p)3  where  p  =  1  -  exp (-1/2)  . 
Hence  for  each  i*0,...,n-l  the  priority  is  upper  bounded  by  an  independent 
geometric  variable  with  parameter  p.  We  therefore  can  apply  Lemma  6.2.  o 


LEMMA  10.3.  Assuming  9,  r"”*  |6i|<(c2+2)n  for  each  6fA  where 

6  *  6  6. . .  6  .6  , . 

0  0  n-1  n-1 


Proof .  For  i  =  0,...,n-l,  let  Tb  be  the  priority  of  the  first  task 
in  6^.  Then 

n-1  n-1 

2  1 6.  |  <  n  +  53  tt.  *  n  +  (c.+l)n 
i-0  1  i-0  1  £ 
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11.  UPPER  BOUNDS  ON  EXECUTION  TIME  OF  PHASE  II 

Let  T  be  the  execution  time  of  Phase  II;  we  wish  to  derive  on 
upper  bounds  for  T,  which  hold  with  high  likelihood.  For  each  delay 
sequence  6€A,  let  T(6)  be  the  time  to  execute  the  tasks  of  6.  Then 

T  *  max  (T(6)) 

6€A 

For  key  x  €  X,  let  Px  be  the  sequence  of  nodes  visited  by  x  while 
stage  (x)  <n.  For  each  delay  sequence  6  6  A,  let  X^  «  {x  €  x|  v  €  Px» 

IT  *  IT (x)  and  [v,tt]  €  5};  this  is  the  set  of  keys  whose  transmission 

are  tasks  of  6. 

Fix  a  constant  c3,  satisfying  0<c3^l,  and  let  d3  =  (c3+l) (c2+4) . 

Let  <?3  be  the  event  |Xg|  <  d3n  for  all  6  €  A. 

d2n 

LEMMA  li.l.  Prob(<?3|#2)  >1-  2  f3(n)  where  f3(n)  * 

2 

exp  ("C-j  (c2+4)n) . 

Proof.  Consider  any  key  task  [v,it]  €6  and  key  x€X,  with  priority (x) 

=  tt.  The  event  x  visits  v  in  phase  II  is  upper-bounded  by  a 
Bernoulli  variable  with  success  probability  2  n,  independent  of  any  other 
key.  But  there  are  at  most  2n  keys  in  X  of  any  given  priority.  Thus 
|Xg|  is  upper-bounded  by  a  binomial  random  variable  with  parameters  2  n  and 

2n 1 6 | .  The  latter  is  less  than  2n(c-+4)  by  Assumption  he”®*a  6*1  therefore 

d  n 

Prob  ( |  Xfi  |  >  d3n  |<?2)  <  f  (n) .  Hence  Prob  (#3  \g  )  >  1-  |q  f  (n)  >  1  -  2  f  3  (n) 

by  Theorem  10.1.  o 

For  each  delay  sequence  6CA  and  key  xCX  let  t(6,x)  ■ 

{  tv,wj  €  6 1 IT  ■  it  (x)  and  v€Px).  Since  exactly  one  step  is  required 
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to  process  key  x  for  each  task  [v,ti]  £t(5,x),  we  have 

T(6)  "  Zx€X  1t(6'x)I* 

6 

Fix  a  constant  c.^1. 

4 

d  n 

THEOREM  11.2.  Prob  (T  <  c  .n  |<?„  a  <£, )  >  1  -  2  -f.(n)  where 

-  a  ‘23  4 

-c  n  (c4+d3)n 

f4(n)  =c4  Ml  +  (c4-l)/2) 

Proof.  Let  be  the  event  |T(5)|^c^n  for  all  6  £  A.  Note  that 

<?.  implies  T<c„n. 

4  4 

For  each  x  £  X^ ,  we  claim  | T (6 , x) |  is  upper  bounded  by  an  independent 
geometric  random  variable  g with  parameter  1/2,  so  we  can  apply  Lemma  6.2 
to  bound  Prob ( | T (6, x) |  < c4n  |<f3)  > 1  -  f4 (n) ,  and  by  Theorem  10.1, 

Prob(£4|£2  a*3)  >1  -  |A|f4(n)  > 1  -  2d2nf4 (n) . 

Now  we  prove  our  claim.  Fix  some  x£X.  Let  p  =  v„,...,v  and 

x  0  m 

IT  *  “  (x)  .  For  each  i  =  0,...,m-l,  let  (v^,u^)  be  t^ie  forward- 

internal  edge  departing  v  and  let  (v^,w^)  be  the  forward-external 

edge  departing  v.  Note  that  ui  £  Px  iff  w^  £  P^.  The  event  u^  €  Px 

given  [v^,tt]  £  t  (6,x) ,  has  independent  probability  1/2  for  each  i =  0, . . . ,m-l. 

Furthermore,  if  v^ € Px  and  [v^f n]  € T (6,x)  but  [vi+1,7Tj  £ t(6,x) ,  then 

[Vj  ,tt]  fL  x  (6,x)  for  j*i+l,...,m.  Hence  for  k>0,  Prob(  |t(6,x)  |  =  k)  <2  k  1 

as  claimed.  □ 

Finally,  we  have; 

THEOREM  11.3.  For  cQ  above  a  small  number ,  3c’  >1  such  that 
Prob (T  <  c4n)  > 1  -  N  C  . 
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12.  BLOCKADE  AVOIDANCE 


We  prove  in  Theorem  8.1  that  the  probability  of  blockage  is  vanishingly 

low  if  cQ,  the  number  of  keys  initially  input  at  each  node,  is  greater  than 

a  small  constant  number,  say  h.  If  1 ^ < h  then  we  can  initially 

randomly  route  all  keys  to  a  subnetwork  of  the  CCC^  with  node  set 
f  logh1  ** 

V'  =V[0  ]  of  cardinality  ^N/h.  Then  each  node  in  V'  will  have 

on  the  average  at  least  packets  and  we  can  execute  FLASHSORT  on  this  sub¬ 
network. 


A  more  practical  method  is  to  modify  our  FLASHSORT  algorithm  so  that 
harmful  blockage  never  occurs.  Note  that  if  o 1 63  is  not  defined  because 
X[8]  is  empty  then  the  rank  computation  of  Section  5  is  still  valid.  The 
harmful  case  is  when  0(8]  is  not  defined  though  X I  S3  is  nonempty.  The 
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modi  fied  algorithm  avoids  this  case  altogether  by  making  each  member  of 

X[8j  a  viable  candidate  for  becoming  a  [8].  We  define  a  new  type  of 

packet  which  we  call  a  candidate  and  which,  for  routing  purposes,  is 

considered  to  be  distinct  from  a  key  or  a  splitter.  We  route  candidate 

packets  for  OlB]  in  essentially  the  opposite  paths  as  for  the  splitters 

derived  from  a(8]-  Suppose  a  node  u£v[8]  receives  a  candidate  packet 

t  of  stage  |g|.  If  u  is  active  or  has  previously  received  any  key, 

splitter  or  candidate  packet  of  stage  | 3 | ,  then  the  current  candidate 

packet  is  deleted.  If  u  is  not  the  root  node  r[g]  then  the  entering 

candidate  packet  t  is  transmitted  across  departing  edge  e  (where  if 

n-  |8 1 

address (u)  has  suffix  0  ,  then  e  is  the  departing  reverse-internal 

edge,  and  if  address (u) [stage(u)  + 1]  = 0  then  e  is  the  departing 
forward-internal  edge,  and  otherwise  e  is  the  departing  forward-external 
edge).  Finally,  if  u  is  the  root  node  r[8],  then  a[B]  is  chosen  to 
be  the  entering  candidate  t,  of  stage  |B|  and  the  splitter-creation 
process  then  proceeds  as  described  in  Section  4.  It  is  easy  to  ver it'- 
using  a  proof  similar  to  Theorem  4.1  that  if  X[B)  ^0,  then  a  candidate  fer 
stage  reaches  the  root  r[g].  Furthermore  an  argument  similar  to  Lemma  10.1 
shows  that  this  candidate  routing  requires  additional  time  on  any  delay 
path  at  most  2n. 


.•  i" 
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13.  FURTHER  WORK 

A  further  paper,  to  appear,  describes  implementation  of  FLASHSORT 

on  the  shuffle-exchange  network,  multi-dimensional  arrays,  grids,  and  also 

hybrid  networks  of  grids  with  CCC  subnetworks.  For  these  hybrid  networks, 

2  2 

FLASHSORT  has  the  asymptotic  optimal  VLSI  bit-complexity  AT  =0(NlogN) 

for  sorting  N  keys  {represented  in  binary)  within  time  T  (with  high 

2 

likelihood)  and  area  A<0(N  ). 
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